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Abstract. In the Korean culture the family members are recorded in special family 
books. This makes it possible to follow the distribution of Korean family names far back 
in history. It is here shown that these name distributions are well described by a simple 
null model, the random group formation (RGF) model. This model makes it possible 
to predict how the name distributions change and these predictions are shown to be 
borne out. In particular, the RGF model predicts that, for married women entering a 
collection of family books in a certain year, the occurrence of the most common family 
name "Kim" should be directly proportional the total number of married women with 
the same proportionality constant for all the years. This prediction is also borne out to 
high degree. We speculate that it reflects some inherent social stability in the Korean 
culture. In addition, we obtain an estimate of the total population of the Korean 
culture down to year 500 AD, based on the RGF model and find about ten thousand 
Kims. 
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1. Introduction 

Your family name is very important in the Korean culture. Abandoning your family 
name is extremely unusual and considered dishonorable. A common metaphor is to 
pledge your family name to a given promise. The importance is also reflected in the 
fact that a married woman carries the family name of the family she comes from. In 
addition, the Confucian tradition has encouraged a family to record the genealogical 
tree in special books, which then records the women's family names flowing into the 
family book by marriages. The children inherit the name of the father. Some of the 
family books go back more than 500 years. 

The distribution of family names is often described in terms of the probability 
P(k) to randomly pick a family name which occurs k times within the population. 
The frequency distribution P(k) for family names are examples of broad 'fat-tailed' 
distributions which, at least crudely, can be described by power laws P(k) oc 1/fc 7 , 
as has been studied before [U EJ |3]. In particular, Korean family names have a 
very broad distribution with a 7 close to one [3]. It is common to try to connect 
the approximate power-law form of family distributions to growth models of the total 
population pQ El [3j [5] . Such models usually yield power laws with approximately the 
Zipfs law exponent 7 = 2, i.e, P(k) oc 1/k 2 . However, this does not describe the 
Korean family distribution, which has a much slower falling-off at large k. It has been 
suggested that this is because the rate of introducing new family names in Korea is very 
slow [31 [5]. In accordance with this, it was in [5] shown that that a growth model with 
an introduction of family names which approaches zero can indeed yield a power law 
with 7 = 1 instead of 7 = 2. 

In [6] , it was shown that system-specific growth models are usually too restrictive to 
catch some of the essential characteristic features of frequency distributions. Examples 
are the dependence of 7 on the size of the data set and the connection between the data- 
set size and the size of the largest frequency. In the present paper, we reinvestigate the 
historical Korean family books from this particular perspective. We use a collection of 
Korean family books to estimate the change in the frequency distribution of family 
names for the last five hundred years. We then compare these changes with the 
predictions of the Random Group Formation (RGF) model introduced in [6]. The 
RGF model assumes maximal mixing for each given data size and it is shown that the 
predictions from this model are borne out to a striking degree. In particular, it is found 
that the proportion of persons named 'Kim' is constant irrespective of all social changes, 
wars, earthquakes, famines, plagues, fertility variations, industrial revolution, etc. and 
this constancy is also an inherent feature of the RGF model. 

In section [21 we describe how we use the data from the Korean Family books and 
in section [31 we give a brief recapitulation of the Random Group Formation model. 
The comparison between data and theoretical predictions are given in section HI while 
section [5] contain some concluding remarks. 
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Table 1. Statistical quantities of the family books analyzed in this work. Here M 
means the total number of women entering this family by marriage at the specified 
period. N means the number of different family names that the women carried. Among 
these N different names, we find the one with the largest number of carriers and denote 
this number by fc max . 
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2. Korean Family Books 

Our data is extracted from the ten Korean family books which were also analyzed in 
[?]. The data we extract in the present investigation is the total number M of married 
women, who were registered with marriage year into these ten books during a specific 
30-year period between 1510 and 1990. For each period, the number of different family 
names N and the number of women having the most common family name (usually 
'Kim'), fc max , are also extracted. The results are given in table [1] which contains sixteen 
historical windows. This data set is analyzed and compared with census data for the 
whole Korean population from year 2000 (see [1]). 

As in |7], we argue that the statistics of this collection of women's family names 
should bear a strong resemblance with the whole population during the same period 
in the following sense: suppose that out of the whole population of women who got 
married at a certain period, you have randomly selected M women. There is a certain 
chance that a chosen woman has a family name which occurs k times among the M 
picked women. Suppose that the probability distribution for the frequency of different 
family names within a group of M randomly selected women is Pm(^)- Then the 
M selected women on the average have family names which occur M/N times since 
Efc=7 kP M (k) = (k) = M/N, where fc max is the most frequent name. The data for 
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an entry in table [T] corresponds to a single try of choosing M women out of the whole 
collection of women who got married during the period. Suppose that you have now 
instead picked M random persons out of the whole population. Then provided that the 
married women were really randomly distributed over the population, the result would 
be statistically the same: the probability distribution for the frequency of different 
family names within a group of M randomly selected persons would again be given by 



There are two important points to notice in this context. The first is that -Pm(&) 
for M randomly selected persons does not have the same functional form as PM tot (&) f° r 
the complete population M tot [6]. In other words, the family-name distribution depends 
on the size of M. The second is that M tot (t) depends on time t in an unknown (but 
presumably rather nontrivial) way reflecting the history of the Korean people. There is, 
at least a priori, no obvious relation between the inflow of married women into the ten 
specific family-name books on one hand, and the total population of the Korean people 
on the other: A prosperous family could have a large inflow of married women even in 
periods when the total population decreases. 

3. The RGF model 

The random group formation model tries to catch the essential features of the group-size 
distribution when M objects are divided into N groups by assuming optimal mixing [6]. 
In the case of family names, the persons are the objects and the groups are formed 
by the persons carrying the same family name. The RGF model does not make any 
explicit assumption about what particular process is responsible for the creation of the 
groups. Instead it is assumed that, whatever this process might be, the result is that 
the optimal mixing condition is on the average approximately fulfilled at all times. 
The optimal mixing condition corresponds to a maximum entropy condition for the 
group-size distribution Pw(fc). The appropriate maximum condition can be formulated 
in terms of a maximum mutual-information principle or equivalently as a minimum 
information-cost condition [8j E]. The result is a distribution function of the explicit 
form 



where k is the size of the smallest group. In the present investigation, this limit is 
always fco = 1, but in other applications it can be generalized to an arbitrary ho. This 
means that the constants 7 and b are interdependent through the relation: 



P M (k). 



(1) 



where Pm(^) obeys the following set of self-consistent equations: 
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The self-consistency condition connects the entropy of Phi{k) to the size of the largest 
group A; max , through the relation 



rmax/ - M ■ [O) 

where (k max ) is the average size of the largest group and the value of k c is determined 
such that there is on the average only a single group in the interval [k c , M], i.e., 
YlT=k c P{ k ) = V-^j which means that (jHJ) gives the average (A; max ). When analyzing 
data we will approximate /c max for the data with the average size of the largest group 
(fc max ) obtained for the RGF model. This set of self-consistent equations yields a unique 
solution Pm(^) of the RGF form for a given values of M, N, k max , and ko [6]. Suppose 
you have a collection of M persons each carrying one out of N different family names 
which are distributed according to the frequency distribution P^f(A;). What is the 
corresponding frequency distribution P m (k) for m persons randomly picked from the 
original total Ml In the case that the persons are picked randomly, P m (k) is given by 
the transformation 

E( m \ k ( M-m \ k ' (k'\ p /w\ 
, k'=k Imt^J {—^-) \k) r My K > 

m{) i-ZkU^r'PMik') 

where ( k A is the binomial coefficient. In the context of word-frequency of books, 
this transformation is sometimes referred to as the Random Book Transformation 
(RBT) [9j [10]. The crucial point is that, if you start with a certain P^f(fc) of the 
RGF form of (jJJ), then all of 7, b, and A will change with m: a random reduction of 
a data set is not scale-invariant with respect to the frequency distribution and as a 
consequence, the power- law index 7 changes 0, [10j E] . Another characteristic feature 
of the transformation is that the number of persons in the largest family group is 
proportional to the size of the data set [6]. 

In order to predict the change of the RGF function under the data-size reduction, 
we first numerically calculate (k) m = Y^k=i k Pm{k) using the RBT transformation and 
then use the RGF self-consistent equations for the input values m, n — mj (k) m , and 
k m a,x{M)m/M, to obtain the corresponding P m (k) of the form of ([TJ. In this way, one 
obtains a predictions for P m (k) starting from a known Pu{k) when m < M. It is also 
possible to get predictions for an increase of the data set (i.e. m > M) in a similar 
way. However, in the present paper, we will, when predicting the increase of a data set, 
resort to two approximate relations found in [6] 

^ ~ 1 K toFM' < 4 > 

describing the expected approach to the large-M limit together with the approximate 
relation b oc 1/M. 

Earlier attempts to explain family-name distributions have usually been connected 
to explicit assumption about the time evolution linked to the growth of the population (TJ 
|2j |3l [5]. One may then ask how the time evolution enters the RGF model. The answer is 
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that it only enters indirectly; the RGF model is in itself history-independent and at each 
time only depends on the instantaneous input parameters. However, one of the input 
parameters is the size of the data set M. Suppose that M is the total population, then 
this parameter is indeed history- and time- dependent. One might even suspect that it 
has a complicated time dependence M(t) reflecting changes due to wars, earthquakes, 
famines, plagues, fertility variations, industrial revolution etc. The point is that the 
RGF model assumes that, whatever this actual historical time dependence might be, 
the resulting family-name frequency distribution to good approximation on the average 
is given by the maximal mixing condition which only depends on the instantaneous 
value of M(t). 

A parallel example of this is provided by the word-frequency distribution of novels 
written by an author [TUl 16]: no matter what size the novel has or when it was written, 
to good approximation the word-frequency distribution for a novel of an author only 
depends on the number of words it contains [10]. This size dependence is to very good 
approximation given by the RGF model [6]. The fact that the word frequency of an 
author to good approximation only depends on the size of the text is equivalent to 
characterizing an author by a single very large "meta-book" from which the average 
frequency distribution from any text size written by the author can be obtained [TO] . 

4. Analysis 

In order to test the RGF model, we start with the three most recent entries in 
table Q] which spans the time period 1900-1990. These three entries together contain 
Mxgoo-go = 165020 women each having one family name out of A 1900 _9o = 194. Out of 
this data set, we randomly select M < M1900-90 women and calculate the average N 
for each M. The full curve in figure [1] shows the average N(M). This random selection 
prediction based on the specific period 1900-1990 is compared to the data covering all 
the time periods from 1500-1990. If the women flowing into the entries by marriage are 
statistically equivalent to selecting random persons in the total population and if the 
population was static in time, then the agreement between the data and the random 
selection would be easy to understand. However, the latter assumption is of course not 
true: both the total population M tot (t) and the number of different names N tot (t) change 
in response to historical developments : From about 1500 to 2000, the population in 
Korea increased by roughly a factor of 6, from about 8 mil to 46 mil. This increase 
is linear in time up to 1900 and then increases faster (compare figure [2]). During the 
same period, the number of family names only increased very slowly, again with a sharp 
increase around 1900. In the rough estimate given below, the increase is about 27 names 
from around 1500 to 2000. In spite of this, the results shown in figure []] suggest that 
this time evolution is such that N(t) depends on time t only through M(t), such that 
at all times N(M) is a unique function. The uniqueness of the function N(M) is also 
a consequence of the RGF model. In the context of word frequencies, it corresponds 
to the meta-book concept discussed in [TQl E] : the word frequency used by an author 
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Figure 1. Comparison between the expected number of different family names and 
data. The full drawn curve is the number of different family names N as a function 
of women's names M, when M women are randomly chosen from all the women's 
recorded in the ten books given in table [T] during the period 1900-1990. The curve 
is the average over 10 2 random choices, and the dotted lines show three standard 
deviations. The three open circles are the explicit data from table [1] for the three 
periods 1900-1930,1930-1960, and 1960-1990. The crosses represent the remaining 
historical data which decreases monotonously with time. The data is consistent with 
a random drawing of persons from a time-independent distribution. 
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year 

Figure 2. Historical estimates of the Korean population in [11] (crosses). The 
rightmost point indicates the census data in year 2000. 

is to good approximation given by a unique function N(M), where N is the number of 
distinct words and M is the total number of words, independent of which book, what 
length the book has, or when it was written. In short, figure [1] suggests that N is only 
a function of M(t) and this particular feature is also consistent with the RGF model. 

In order to illustrate the consequences of a time-independent distribution further, 
we use the data from [5] for the introduction years of 189 families. Figure |3]^a) shows 
the number of these 189 families which existed at a given time from year 500 to 2000. 
Note that the increase is slow: from year 1500 to 2000 the increase is only about 20%. 
Now imagine that you pick a group of male persons which belong to one of these 189 
families in year 2000. If we follow the lineage of this group of m persons back in time, 
the only decrease in the number of family names is caused by the fact that the lineage 
stops. Suppose that you randomly pick a fraction m/M of the population for year 
2000, which on the average contains 189 different names. According to the family books 
in 1900-1990 (table HJ, this number amounts to m ~ 1.4 x 10 5 [figure QQ. Roughly a 
half of this group is male and if we follow the lineage backward in time, the remaining 
lineage will be roughly proportional to the total population so that m{t)/M{t) = const. 
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Figure 3. (a) Number of family names that are known to be introduced in Korea until 
a certain year (solid) and our predicted number of them (crosses), which is obtained 
by using the historical estimates [figure [2] and the family books in 1900-1990 (see 
text), (b) Our predicted size of population in the past (solid), based on the number 
of people in the same family books, carrying Nf different names shown in (a). The 
crosses are the same historical estimates as in figure [2] Using this prediction as the 
input parameter of the RGF description [see Q], we also estimate the total number 
of family names (dotted). 



As the population decreases, the average number of names, n, is hence just given by 
n(m) provided that this to good approximation is a unique function independent of 
time. In figure [3](a), the crosses show the prediction from this assumption, using the 
information on the total population given in figure [2] and the estimated N(M) given 
in figure [TJ As seen here, both the sharp decrease from 2000 to 1900 and the slow 
decrease between 1900-1400 are correctly reproduced. In figure [3](b), we do the inverse, 
using the same assumption: the data for the 189 families given in figure [3(a) is used as 
an input to estimate the total population within the period 500-2000. The full curve 
in figure [3(b) gives the estimated population by this method. Comparison with the 
historical estimates from [TT] shows that the agreement is everywhere within a factor 
of 2 [see figure [3(b)] . The estimate suggests that the population at about 500 AD 
was around 4 x 10 4 and exceeded 10 7 around 1400 AD. It should be noted that our 
estimate of the total population refers to all persons integrated in the society having a 
Korean family name. This might, of course, in the old age only be part of the actual 
population controlled by the society. Figure [3(b) also shows the expected number of 
family names based on the prediction for the total population possessing Korean family 
names. According to this estimate there was already around 150 family names around 
year 500 AD. Note that the expected number also reflects the fact that family names can 
both be added and subtracted from the population in the cause of history (some families 
simply goes extinct). Thus the rate of increase for the total population is expected to 
be slower than for the 189 families which only include families which have survived until 
year 2000. This observation is also consistent with our prediction: there are introduced 
155 names during the period 500-2000 according to our data set plotted in figure [3(a), 
and our predicted increase of the total number during the same period also happens to 
be 155. 

A second feature of the RGF model is that the largest group is always proportional 
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to the total size of the data set [5]. In the context of Korean family names, this implies 
that the proportion of persons named 'Kim' in a randomly picked group of Koreans 
should be constant, irrespectively of historical time or size of the group. In order to 
test this, we again start from the three latest historical windows in table [IJ i.e., 1900- 
1930,1930-1960, and 1960-1990. Note that period 1930-1960 is the largest group, so that 
M actually decreases with time during some period within 1930-1990. These three data 
points are plotted in figure H] (open circles) and the straight line in the figure is obtained 
by the least square fitting to these three data points. This linear prediction, based on 
the data from 1900-1990, is then directly compared to the data for the thirteen time 
windows from 1510 to 1900. The agreement is striking. In addition, the numbers of 
Kims from the census of year 2000 is given by the asterisk. Thus the figure spans group 
sizes from M = 33 to M = 4.6 x 10 7 and history from year 1510 to 2000. From this 
perspective, the proportionality is borne out to an amazing degree. Is this a surprising 
result? All it is saying is that the total number of persons named Kim grows and 
decreases at precisely the same rate as the total population. If the number of persons 
belonging to any family grows and decreases precisely as the total population then the 
result will follow. However, this is not the case. The ten families in tabled] show a rapid 
increase of inflow of women by marriage from 1510 to 1900. The number of in-flowing 
women is an approximate measure of the total number of persons who carry one of the 
ten names. Figure [5] shows this rapid growth: a factor of about 60 to be compared to 
a factor of about 2 for the total population during the same period. Thus individual 
name groups, in general, grow and decrease in a very different way compared to the 
total population. Nor do growth models in general predict that the largest group grows 
linearly with the size. For example, the Simon model predicts that the largest group 
grows like fc max oc M 1_a where < a < 1 is the probability for a new name appearing 
during a time step [121 E]- This means that only in the trivial case when the whole 
population is named Kim (which corresponds to a = 0) is there a direct proportionality. 
Thus, in spite of the simplicity of the result, it appears to be nontrivial. Nevertheless, 
it is consistent with the prediction of the RGF model. We also note that combining our 
population estimate of persons with Korean family names with the size proportionality 
of Kim suggests that at around 500 AD there were already about 10000 Koreans named 
Kim. 

So far, we have only discussed two features of the RGF model: the uniqueness 
and time independence of the function N(M) and the proportionality between the 
size of the largest group for any random part of a population with time-independent 
proportionality. However, the RGF model also gives a precise prediction for the actual 
group-size distribution Pm(^) in the form of p]). In order to test this prediction, we again 
start from the data in table [1] and the period 1900-1990 which contains M tot = 165020 
women each having one family name out of N tot = 194 and where fc max = 32316 are 
named Kim. These three numbers M tot , N tot and fc max uniquely determines Pm(&) within 
the RGF model, as explained in section [3] [6J . The middle full curve in figure [6] gives 
the predicted size distribution and the pluses denote the actual (binned) data points. 
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Figure 4. The number of persons with the most frequent family name in each family 
book. The circles on the upper right side show the three most recent data sets of 
table [TJ from which the slope a of the line has been determined by linear fitting, 
^max = Q,M. Note that these three points are not in time-order since the middle point 
is the latest. The crosses are the remaining thirteen time windows for which time and 
size is in the same order. The asterisk is the number of Kims for the whole population 
according to the census in year 2000. The proportionality with the size of the group 
and the number of Kims are borne out to very high degree irrespective of time. 



The agreement between the RGF prediction and the data is very good, in particular in 
view of the fact that the prediction is based solely on the three numbers M tot , N tot and 
fc maj[ . The prediction for the exponent 7 in §B) is 7 = 1.12. As explained in section El 
the RGF model allows you to predict how the Pm(^) f° r either a smaller, m < M or 
larger m > M. Figure displays the predicted change of 7 when starting from the data 
given for the period between 1900-1990. The solid curve gives the change when m is 
decreasing and the dotted line as it is increasing. The fact that 7 changes with the size 
of the data set is a fundamental feature of the RGF model and distinguishes it from the 
usual growth models which in general give scale-invariant and hence size-independent 
7 [6]. The left full curve in figure [6] is the prediction for the 1600-1630 data only using 
the data for 1900 — 1990 and the number m = 384, which is the number of women 
getting married into the ten families during the period 1600-1630. The actual name- 
frequency distribution for these women is given by the crosses and the agreement is 
again quite good. In particular, note that the data is indeed consistent with the slightly 
steeper slope for smaller k caused by a slightly larger 7 = 1.22 (compare figure CJ). The 
rightmost curve in figure in the same way, gives the prediction based on only using 
the data for the married women in 1900 — 1990 and the total population size in year 
2000 given by m = 4.6 x 10 7 . The census data from year 2000 is also plotted and the 
agreement between prediction and the data is again very good. This time the 7 = 1.07 
is even closer to one. In short, the RGF model gives very good predictions for the 
distribution of actual Korean name-group sizes both backward and forward in time. 

5. Concluding Remarks 



Our analysis suggests that the family-name distribution within the Korean population 
shares characteristic features with the word-frequency distribution for an author: both 
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Figure 5. M as a function of time in table [TJ 




Figure 6. Comparison between the actual family books (points) and RGF predictions 
(lines). 

1.3 
1.2 
1.1 
1 

10 2 10 4 10 6 10 : 

M 




Figure 7. Power-law exponent as a function of M. The solid line is obtained by 
analyzing the family books from 1900-1990, and the dotted line connects to the census 
data in 2000. 



are to good approximation described by a "meta-book" distribution N(M) and both 
are well described by the RGF model |10l |6]. The "meta-book" distribution for an 
author gives a unique relation between a text of length M written by an author and 
the number of distinct words used, N. In the Korean case, this corresponds to the 
number of distinct family names N you typically find in a group of M Koreans. In the 
word-frequency case, this leads to the conclusion that the most common word used by 
an author in an English text, the, is proportional to the total size of the text M. The 
corresponding conclusion for the Korean family names is that the most common name, 
Kim, in a group of M Koreans should on the average always be proportional to a the 
size M. This prediction was checked with data ranging from 1510 AD to 2000 AD and 
group sizes m the interval [33, 4.6 x 10 7 ] and was found to be obeyed with high precision. 
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It was argued that this is a nontrivial result for two reasons: first, it was shown that the 
rise and fall of individual families in general have no simple relation to the rise and fall 
of the total population; the only obvious relation is that the members of all the families 
collectively varies as the total population. Second, usual growth models, like the Simon 
model, predicts that the size of the largest family grows slower than the population. 

The fact that the name distribution to good approximation appears to follow a 
unique N(M), made it possible to estimate the size of the Korean culture down to 
year 500 AD, by using the statistics for the years in which 189 Korean family names 
were introduced. This estimate suggested that the total population was around 5 x 10 4 
persons with Korean family names of which about 10 4 carried the family name Kim. 
The total number of family names at year 500 AD was predicted to be around 150. We 
think that these are fascinating conclusions, although we cannot judge the historical 
realism and implication of the ten thousand Kims. 

Finally, we demonstrated the actual frequency distributions Korean names follow 
the RGF distribution with a size-dependent power-law exponent 7 [6]. 

What does these results imply for the Korean culture? We speculate that the 
answer is stability. It seems that some core of the Korean culture has remained intact 
over at least 1500 years and as both the population and occupied area expanded, 
it basically swallowed other cultural influences without compromising its core. An 
interesting question is if this type of analysis could also be applied to other cultures. 
This, however, remains for the future. 
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